Incremental Least Squares Policy Iteration for POMDPs
نویسندگان
چکیده
We present a new algorithm, incremental least squares policy iteration (ILSPI), for finding the infinite-horizon policy for partially observable Markov decision processes (POMDPs). The ILSPI algorithm computes a basis representation of the value function by minimizing the Bellman residual and it performs policy improvement in reachable belief states. A number of optimal basis functions are determined by the ILSPI to minimize the Bellman residual incrementally, via efficient computations. We show that ILSPI improves the policy successively on a set of most probable belief points sampled from the reachable belief set. Like policy iteration in general, the ILSPI converges within a small number of iterations. The results on four benchmark problems show that the ILSPI compares competitively to its value-iteration counterparts in terms of both performance and computational efficiency. The Infinite Horizon POMDP Problem • The POMDP is a tuple (S,A,O, T, Ω, R) • Bellman equation V π(b) = ∑ s∈S b(s)R(s, π(b)) + γ ∑ o∈O p(o|b, π(b))V π(̃b o ) (1) ◦ π is a stationary policy producing action a = π(b), ∀ b. ◦ V π(b) is the value of b by following π over the infinite horizon ◦ γ ∈ [0, 1) is a discount factor, and b̃o(s ′) = ∑ s∈S b(s)T a ss ′Ω a s ′o p(o|b, a) (2) p(o|b, a) = s ′∈S ∑ s∈S b(s)T a ss ′Ω a s ′o (3) •Objective — finding the π that yields the maximum value for each belief state over the infinite horizon. Policy Iteration for POMDP Theorem 1 (Howard-Blackwell policy improvement) Let V π(b) be the infinite-horizon value function of a stationary policy a = π(b). Define the Q function Qπ(b, a) = ∑ s∈S b(s)R(s, a) + γ ∑ o∈O p(o|b, a)V (̃bo) (4) where b̃o is defined by (2), and the new policy π′(b) = argmaxa Qπ(b, a) (5) Then π′ is an improved policy over π, i.e., V π ′ (b) ≥ V π(b) (6) for any belief point (belief state) b. A policy iteration algorithm iteratively applies the policy improvement theorem to obtain successively improved policies: • Policy evaluation — computing the value function V π(b) by solving the Bellman equation (1); • Policy improvement — improving π according to Theorem 1. The two steps are iterated until V π(b) converges for all b. The Proposed ILSPI for POMDPs • The Incremental Least Squares Policy Iteration (ILSPI) performs policy iteration on a finite set of sample belief states reachable by the POMDP — following the idea in the PBVI algorithm (Pineau, Gordon, & Thrun 2003). • The ILSPI incrementally reduces the Bellman residual by selecting the optimal basis functions. • The ILSPI performs policy improvement in a point-wise manner, working on belief points one-by-one. • The ILSPI extends the least squares policy iteration (LSPI) for MDP (Lagoudakis & Parr 2002) in two respects — it solves the policy for a POMDP and it automatically determines the optimal basis functions. Reachable Belief Points ∪t=0Bt The ILSPI focuses on improving the policy on the belief points reachable by the POMDP. • Let B0 be a set of initial belief points (states) at time t = 0. • For time t = 1, 2, · · · , let Bt be the set of all possible b̃o in (2), ∀ b ∈ Bt−1, ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0. • Then ∪∞t=0Bt is the set of belief points reachable by the POMDP by starting from B0. Policy Iteration on ∪t=0Bt Theorem 2 (Policy improvement in reachable belief states) Let B0 be a set of initial belief points (states) at time t = 0. For time t = 1, 2, · · · , let Bt be the set of all possible b̃o in (2), ∀ b ∈ Bt−1, ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0. Let π be a stationary policy and V π(b) its infinite-horizon value function. Define the Q funciton Qπ(b, a) = ∑ s∈S b(s)R(s, a) + γ ∑ o∈O p(o|b, a)V (̃bo) (7) Then the new policy π′(b) = arg max a Qπ(b, a), ∀ b ∈ ∪∞t=0Bt (8) improves over π for any belief state (point) b ∈ ∪∞t=0Bt, i.e, V π ′ (b) ≥ V π(b), ∀ b ∈ ∪∞t=0Bt (9) Pruning of ∪t=0Bt The ∪∞t=0Bt may still be infinitely large. We obtain a manageable belief set by sampling from ∪∞t=0Bt. The following figure illustrates how to prune ∪∞t=0Bt and generate a finite set of belief points B belief points at t=0 belief points at t=1 Belief expansion
منابع مشابه
An Incremental DC Algorithm for the Minimum Sum-of-Squares Clustering
Here, an algorithm is presented for solving the minimum sum-of-squares clustering problems using their difference of convex representations. The proposed algorithm is based on an incremental approach and applies the well known DC algorithm at each iteration. The proposed algorithm is tested and compared with other clustering algorithms using large real world data sets.
متن کاملBounded Policy Iteration for Decentralized POMDPs
We present a bounded policy iteration algorithm for infinite-horizon decentralized POMDPs. Policies are represented as joint stochastic finite-state controllers, which consist of a local controller for each agent. We also let a joint controller include a correlation device that allows the agents to correlate their behavior without exchanging information during execution, and show that this lead...
متن کاملDistributed and Cooperative Compressive Sensing Recovery Algorithm for Wireless Sensor Networks with Bi-directional Incremental Topology
Recently, the problem of compressive sensing (CS) has attracted lots of attention in the area of signal processing. So, much of the research in this field is being carried out in this issue. One of the applications where CS could be used is wireless sensor networks (WSNs). The structure of WSNs consists of many low power wireless sensors. This requires that any improved algorithm for this appli...
متن کاملLeast-Squares Policy Iteration
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference a...
متن کاملReinforcement Learning of Multi-Party Trading Dialog Policies
Trading dialogs are a kind of negotiation in which an exchange of ownership of items is discussed, and these kinds of dialogs are pervasive in many situations. Recently, there has been an increasing amount of research on applying reinforcement learning (RL) to negotiation dialog domains. However, in previous research, the focus was on negotiation dialog between two participants only, ignoring c...
متن کامل